Graduate Research Assistantship(Sagar Mehta) Twitter Text Analysis
Business Problem:
Work Done:
In this step we import the required fundamental libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
We have scraped data for the keyword Domestic Violence.
dv = pd.read_csv('domestic violence_week1.csv')
dv.sample(5)
The dataset contains the following information:
| Field | Description |
|---|---|
| Datetime | The date and time information when the tweet was tweeted |
| Tweet Id | The unique tweet identification number |
| Text | The actual tweet |
| Username | The username who actually tweeted |
| Like Count | The count of likes a tweet got |
| Display Name | The actual name of the user |
| Language | The language in which the tweet was made |
The following are the steps we followed to clean the data:
These steps are very important when it comes to cleaning text data
def tweet_cleaner(text):
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
import re
text=re.sub(r'@+','',str(text))
text=re.sub('http?://[A-Za-z0-9./]+','',str(text))
text=re.sub("[^a-zA-Z]", " ",str(text))
lower_case = text.lower()
words = tokenizer.tokenize(lower_case)
return (" ".join(words)).strip()
We need to further clean the data and we use nltk
We follow the following steps:
import nltk
from nltk.corpus import stopwords
from string import punctuation
from nltk.stem import wordnet
from nltk.stem import WordNetLemmatizer
#Tokenize Function
def Tokenize(string):
tokens=nltk.tokenize.word_tokenize(string)
return " ".join(tokens)
#RemoveStopWordsFunction
def RemoveStopWords(string):
#Removing Punctuations
for each in punctuation:
string=string.replace(each,"")
#Removing Stopwords
english_stopwords=stopwords.words('english')
stopwords_removed_tokens=[]
words=string.split(" ")
for each in words:
if each not in english_stopwords:
stopwords_removed_tokens.append(each)
return " ".join(stopwords_removed_tokens)
#LemmatizeFunction
def Lemmatize(string):
word_lem=WordNetLemmatizer()
words=string.split()
lemmatizeWords=[]
for each in words:
lemmatizeWords.append(word_lem.lemmatize(each))
return " ".join(lemmatizeWords)
def Refine(string):
return Lemmatize(RemoveStopWords(Tokenize(string)))
tweets = dv.loc[:,['Datetime','Text']] # we make a new dataframe with only required columns
tweets['text_cleaned'] = tweets['Text'].apply(tweet_cleaner)
tweets.head()
tweets['clean-text_removed'] = tweets['text_cleaned'].apply(Refine)
tweets.head()
from PIL import Image
import os
from os import path,getcwd
image_mask = np.array(Image.open('USA MAP BLACK.png'))
d = os.getcwd()
from wordcloud import WordCloud, STOPWORDS,ImageColorGenerator
text = tweets['clean-text_removed'].values
plt.figure(figsize=(15,15))
stopwords = set(STOPWORDS)
image_wc = WordCloud(background_color='black',max_words=250000,stopwords=stopwords,mask=image_mask,colormap='twilight_shifted')
image_wc.generate(str(text))
plt.imshow(image_wc,interpolation='bilinear')
plt.axis("off")
plt.show()
The goal was to understand presence of certain words and then define any correlation between these words and domestic violence.
keywords = ['Alcoholism','complications','Substance-related','disorders','complications',
'Family relations','Spouse abuse','Substance abuse','Substance use', 'Intimate Partner Violence',
'cycles of escalation','Duluth model', 'batterer intervention programs',
'intimate partner violence','perpetrator','recidivism','intimidation','threats','physical violence',
'sexual violence','isolation','economic abuse','stalking','psychological abuse',
'coercion related to mental health']
import re
pattern = '|'.join(f"\\b{k}\\b" for k in keywords) # Whole words only
matches = {k: 0 for k in keywords}
for title in tweets['Text']:
for match in re.findall(pattern, title):
matches[match] += 1
print(matches)
match = pd.DataFrame(matches,index=[0])
match.T.sort_values(0).plot(kind = 'barh',legend = False,
figsize = (15,8),title = 'Keywords analysis')
plt.show()
From the keyword analysis above we come to the following conclusion:
dv.groupby(['Username','Display Name'])['Text'].count().sort_values(ascending = False)[:10].plot(kind = 'barh')
plt.show()
The Top 10 usernames are mostly organizations that support victims of Domestic violence.
A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.
Topic models are also referred to as probabilistic topic models, which refers to statistical algorithms for discovering the latent semantic structures of an extensive text body.
Source: Wikipedia
We use LDA based Topic modeling.
LDA stands for Latent Dirichlet Allocation.
We import LDA from Sklearn. Please click on the link for documentation
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=5,random_state=42)
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_df=0.95,min_df=2,stop_words='english')
dtm = cv.fit_transform(tweets['clean-text_removed'])
lda.fit(dtm)
for i,topic in enumerate(lda.components_):
print(f"The top 15 words for the topic #{i}")
print([cv.get_feature_names()[index] for index in topic.argsort()[-15:]])
topic_results = lda.transform(dtm)
tweets['topic_result_lda'] = topic_results.argmax(axis = 1)
tweets.sample(5)
topic_mapping = {4:'Discussion',1:'News',3:'Campaign and Law',2:'Help and services',0:'Arrests & Court Cases'}
tweets['topic'] = tweets.topic_result_lda.map(topic_mapping)
tweets.sample(5)
plt.figure(figsize=(15,8))
tweets.topic.value_counts().plot(kind = 'bar')
plt.xlabel('Topics')
plt.ylabel('counts')
plt.title('Different topics')
plt.xticks(rotation = 0)
plt.show()
Conclusion from topic modeling:
We follow similar steps and do the same procedure and scrape tweets as we did for domestic violence
sa = pd.read_csv('substance abuse_week2.csv',lineterminator='\n')
sa.sample(5)
tweets = sa.loc[:,['Datetime','Text']]
tweets['text_cleaned'] = tweets['Text'].apply(tweet_cleaner)
tweets.head()
tweets['clean-text_removed'] = tweets['text_cleaned'].apply(Refine)
tweets.head()
from PIL import Image
from wordcloud import WordCloud, STOPWORDS,ImageColorGenerator
image_mask = np.array(Image.open('USA MAP BLACK.png'))
text = tweets['clean-text_removed'].values
plt.figure(figsize=(15,15))
stopwords = set(STOPWORDS)
image_wc = WordCloud(background_color='black',max_words=250000,stopwords=stopwords,mask=image_mask,colormap='twilight_shifted')
image_wc.generate(str(text))
plt.imshow(image_wc,interpolation='bilinear')
plt.axis("off")
plt.show()
keywords = ['Alcoholism','complications','Domestic abuse','disorders','complications',
'Family relations','Spouse abuse','domestic violence','Domestic Violence', 'Intimate Partner Violence',
'intimate partner violence','perpetrator','recidivism','intimidation','threats','physical violence',
'sexual violence','isolation','stalking','psychological abuse',
'mental health']
import re
pattern = '|'.join(f"\\b{k}\\b" for k in keywords) # Whole words only
matches = {k: 0 for k in keywords}
for title in tweets['Text']:
for match in re.findall(pattern, title):
matches[match] += 1
print(matches)
match = pd.DataFrame(matches,index=[0])
match.T.sort_values(0).plot(kind = 'barh',legend = False,
figsize = (15,8),title = 'Keywords analysis')
plt.show()
From this figure above we come to the following conclusion:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=5,random_state=42)
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_df=0.95,min_df=2,stop_words='english')
dtm = cv.fit_transform(tweets['clean-text_removed'])
lda.fit(dtm)
for i,topic in enumerate(lda.components_):
print(f"The top 15 words for the topic #{i}")
print([cv.get_feature_names()[index] for index in topic.argsort()[-15:]])
topic_results = lda.transform(dtm)
tweets['topic_result_lda'] = topic_results.argmax(axis = 1)
tweets.sample(5)
topic_mapping = {0:'Discussion',1:'Causes and effects',2:'Treatment and Remedies',3:'Awareness',4:'News'}
tweets['topic'] = tweets.topic_result_lda.map(topic_mapping)
plt.figure(figsize=(15,8))
tweets.topic.value_counts().plot(kind = 'bar')
plt.xlabel('Topics')
plt.ylabel('counts')
plt.title('Different topics')
plt.xticks(rotation = 0)
plt.show()
Conclusion from topic modeling:
sa.groupby(['Username','Display Name'])['Text'].count().sort_values(ascending = False)[:10].plot(kind = 'barh')
plt.show()
The Top 10 usernames are mostly organizations that support victims of Substance abuse
Week 3
Scrape data for keyword substance abuse using snscrape
The chunks below show dummy code for how to scrape using snscrape
import snscrape.modules.twitter as sntwitter
import pandas
# Creating list to append tweet data to
tweets_list2 = []
# Using TwitterSearchScraper to scrape data and append tweets to list
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('substance abuse since:2021-01-01 until:2021-03-25').get_items()):
if i>50:
break
tweets_list2.append([tweet.date, tweet.id, tweet.content])
# Creating a dataframe from the tweets list above
tweets_dfdummy = pd.DataFrame(tweets_list2, columns=['Datetime', 'Tweet Id', 'Text'])
import snscrape.modules.twitter as sntwitter
import pandas
# Creating list to append tweet data to
tweets_list2 = []
# Using TwitterSearchScraper to scrape data and append tweets to list
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('substance abuse since:2021-01-01 until:2021-03-25').get_items()):
if i>50000:
break
tweets_list2.append([tweet.date, tweet.id, tweet.content])
# Creating a dataframe from the tweets list above
tweets_df2 = pd.DataFrame(tweets_list2, columns=['Datetime', 'Tweet Id', 'Text'])
Scrape data for keyword domestic violence using snsscrape
import snscrape.modules.twitter as sntwitter
import pandas
# Creating list to append tweet data to
tweets_list7 = []
# Using TwitterSearchScraper to scrape data and append tweets to list
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('domestic violence since:2021-01-01 until:2021-11-21').get_items()):
if i>200000:
break
tweets_list7.append([tweet.date, tweet.id, tweet.content])
# Creating a dataframe from the tweets list above
tweets_df7 = pd.DataFrame(tweets_list7, columns=['Datetime', 'Tweet Id', 'Text'])
# After merging all the files create a csv file and load to the local computer and reload the files
Read the dataframe and inspect
sa = pd.read_csv('Substance abuse final.csv',lineterminator='\n')
dv = pd.read_csv('Domestic violence final.csv')
sa = pd.read_csv('Substance abuse final.csv',lineterminator='\n')
sa['Datetime'] = pd.to_datetime(sa['Datetime'])
Check for datatypes
sa.dtypes
Inspect first few columns
sa.head()
dv.head()
Select only relevant columns
dv_df = dv.loc[:,['date','Text']]
sa_df = sa.loc[:,['Date','Text']]
sa_df.head()
dv_df.head()
Data Preprocessing
dv_df['text_cleaned'] = dv_df['Text'].apply(tweet_cleaner)
sa_df['text_cleaned'] = sa_df['Text'].apply(tweet_cleaner)
dv_df.head(1)
sa_df.head(1)
More Data cleaning
dv_df['clean-text_removed'] = dv_df['text_cleaned'].apply(Refine)
sa_df['clean-text_removed'] = sa_df['text_cleaned'].apply(Refine)
sa_df.head()
dv_df.head()
Word cloud
Word cloud for Domestic violence
from wordcloud import WordCloud, STOPWORDS,ImageColorGenerator
image_colors = ImageColorGenerator(image_mask)
from PIL import Image
from wordcloud import WordCloud, STOPWORDS,ImageColorGenerator
image_mask = np.array(Image.open('Publication1-2.jpeg').convert("L"))
text = dv_df['clean-text_removed'].values
wc_dv = WordCloud(stopwords=STOPWORDS,
background_color="white",
mode="RGBA",
max_words=10000,
#contour_width=3,
repeat=True,
mask=image_mask,
colormap='brg')
wc_dv.generate(str(text))
#wc_dv.recolor(color_func = image_colors)
plt.figure(figsize=(15,15))
plt.imshow(wc_dv)
plt.axis("off")
plt.show()
Word cloud for Substance abuse
from PIL import Image
image_mask = np.array(Image.open('sa.jpeg'))
from wordcloud import WordCloud, STOPWORDS,ImageColorGenerator
image_colors = ImageColorGenerator(image_mask)
text = sa_df['clean-text_removed'].values
wc_dv = WordCloud(stopwords=STOPWORDS,
background_color="white",
mode="RGBA",
max_words=100000,
#contour_width=3,
repeat=True,
mask=image_mask,
colormap='gist_yarg')
wc_dv.generate(str(text))
plt.figure(figsize=(15,15))
plt.imshow(wc_dv)
plt.axis("off")
plt.savefig('sv.png')
plt.show()
What are n-grams?
N-grams are continuous sequences of words or symbols or tokens in a document.
In technical terms, they can be defined as the neighbouring sequences of items in a document.
They come into play when we deal with text data in NLP(Natural Language Processing) tasks.
| n | Term |
|---|---|
| 1 | Unigram |
| 2 | Bigram |
| 3 | Trigram |
| n | n-gram |
| Example | Type of ngram |
|---|---|
| ['I','stay' 'in' 'Atlanta'] | Unigram |
| ['I stay','stay in','in Atlanta] | Bigram |
| ['I'stay in'], ['stay in Atlanta] | Trigram |
Source: Analytics vidhya
Ngrams
#dv
plt.style.use('bmh')
import re
import nltk
def basic_clean(text):
wnl = nltk.stem.WordNetLemmatizer()
stopwords = nltk.corpus.stopwords.words('english')
text = text.lower()
words = re.sub(r'[^\w\s]', '', text).split()
return [wnl.lemmatize(word) for word in words if word not in stopwords]
words = basic_clean(''.join(str(dv_df['Text'].tolist())))
plt.figure(figsize=(15,8))
pd.Series(nltk.ngrams(words, 4)).value_counts()[:20].plot(kind = 'barh')
plt.ylabel('N = 4')
plt.title('N-grams(4) analysis')
plt.show()
The words that occur most with domestic violence are awareness, month, sexual,assault
words = basic_clean(''.join(str(sa_df['Text'].tolist())))
plt.figure(figsize=(15,8))
pd.Series(nltk.ngrams(words, 4)).value_counts()[:20].plot(kind = 'barh')
plt.ylabel('N = 4')
plt.title('N-grams(4) analysis')
plt.show()
Since we wanted to understand the relation between substance abuse and domestic violence we just passed those keywords to check for their frequency count
keywords = ['domestic violence','substance abuse','abuse']
import re
pattern = '|'.join(f"\\b{k}\\b" for k in keywords) # Whole words only
matches = {k: 0 for k in keywords}
for title in dv_df['Text']:
for match in re.findall(pattern, str(title)):
matches[match] += 1
print(matches)
keywords = ['domestic violence','substance abuse','violence']
import re
pattern = '|'.join(f"\\b{k}\\b" for k in keywords) # Whole words only
matches = {k: 0 for k in keywords}
for title in sa_df['Text']:
for match in re.findall(pattern, str(title)):
matches[match] += 1
print(matches)
What Is a Time Series?
This can be contrasted with cross-sectional data, which captures a point-in-time.
In investing, a time series tracks the movement of the chosen data points, such as a security’s price, over a specified period of time with data points recorded at regular intervals.
KEY TAKEAWAYS
Source: Investopedia
dv_df['date'] = pd.to_datetime(dv_df['date'] )
df_dv_ts = dv_df.groupby('date')['Text'].count().reset_index()
plt.style.use('fivethirtyeight')
plt.figure(figsize = (20,10))
sns.lineplot(x = 'date',y = "Text",data =df_dv_ts )
plt.show()
sa_df['Date'] = pd.to_datetime(sa_df['Date'] )
sa_df_ts = sa_df.groupby('Date')['Text'].count().reset_index()
plt.style.use('fivethirtyeight')
plt.figure(figsize = (20,10))
sns.lineplot(x = 'Date',y = "Text",data =sa_df_ts )
plt.show()
import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.dates as mdates
import matplotlib.ticker as ticker
import pandas as pd
import seaborn as sns
from datetime import datetime
df2021 =sa_df_ts[sa_df_ts['Date'] > '2020-12-31']
plt.style.use('fivethirtyeight')
plt.figure(figsize=(15,8))
ax1 = sns.lineplot(x = 'Date',y = 'Text',data = df2021,label = 'Substance Abuse Tweets')
ax2 = sns.lineplot(x = 'date',y = 'Text',data = df_dv_ts, label = 'Domestic violence Tweets')
legend = ax1.legend(loc='upper left')
plt.ylabel('Tweet count')
plt.title('Domestic Violence VS Subtance Abuse tweets')
plt.show()
What is Prophet?
The Prophet Forecasting Model We use a decomposable time series model with three main model components: trend, seasonality, and holidays. They are combined in the following equation:
y(t) = g(t) + s(t) + h(t) + εt
εt: error term accounts for any unusual changes not accommodated by the model
Using time as a regressor, Prophet is trying to fit several linear and non linear functions of time as components.
from fbprophet import Prophet
from fbprophet.plot import plot_plotly
import plotly.offline as py
df1 = sa_df_ts.rename(columns={'Date': 'ds',
'Text': 'y'})
df1.head()
df2 = df_dv_ts.rename(columns={'date': 'ds',
'Text': 'y'})
df2.head()
Substance abuse
my_model = Prophet(interval_width=0.95,daily_seasonality = True)
my_model.fit(df1)
future_dates = my_model.make_future_dataframe(periods=12, freq='MS')
future_dates.head()
forecast = my_model.predict(future_dates)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].head()
my_model.plot(forecast, uncertainty=True)
Changepoints
Changepoints are the datetime points where the time series have abrupt changes in the trajectory.
By default, Prophet adds 25 changepoints to the initial 80% of the data-set.
plt.style.use('default')
from fbprophet.plot import add_changepoints_to_plot
fig = my_model.plot(forecast)
a = add_changepoints_to_plot(fig.gca(), my_model, forecast)
pro_change= Prophet(n_changepoints=20, yearly_seasonality=True)
forecast = pro_change.fit(df1).predict(future_dates)
fig= pro_change.plot(forecast);
a = add_changepoints_to_plot(fig.gca(), pro_change, forecast)
Domestic violence
my_model = Prophet(interval_width=0.95,daily_seasonality = True)
my_model.fit(df2)
future_dates = my_model.make_future_dataframe(periods=12, freq='MS')
future_dates.head()
forecast = my_model.predict(future_dates)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].head()
my_model.plot(forecast, uncertainty=True)
plt.style.use('default')
from fbprophet.plot import add_changepoints_to_plot
fig = my_model.plot(forecast)
a = add_changepoints_to_plot(fig.gca(), my_model, forecast)
pro_change= Prophet(n_changepoints=20, yearly_seasonality=True)
forecast = pro_change.fit(df1).predict(future_dates)
fig= pro_change.plot(forecast);
a = add_changepoints_to_plot(fig.gca(), pro_change, forecast)
Domestic Violence Organizations
We researched all the organizations that help victims of domestic violence and we came across 100 organizations.
df = pd.read_excel('Domestic Violence Associations.xlsx',engine='openpyxl')
df
import plotly.express as px
import numpy as np
fig = px.sunburst(df, path=['Category', 'Type of Organizations','Organizations'],
color_continuous_scale='RdBu',width=800, height=1000)
#color_continuous_midpoint=np.average(df['lifeExp'], weights=df['pop']))
fig.show()
#fig.write_html("Association.html")
Future work:
For Dataset and other information related to the project, click on my name Sagar Mehta and find the link to my github profile, where you will find everything.